Objective assessment of tumor infiltrating lymphocytes as a prognostic marker in melanoma using machine learning algorithms

Summary Background The prognostic value of tumor-infiltrating lymphocytes (TILs) assessed by machine learning algorithms in melanoma patients has been previously demonstrated but has not been widely adopted in the clinic. We evaluated the prognostic value of objective automated electronic TILs (eTILs) quantification to define a subset of melanoma patients with a low risk of relapse after surgical treatment. Methods We analyzed data for 785 patients from 5 independent cohorts from multiple institutions to validate our previous finding that automated TIL score is prognostic in clinically-localized primary melanoma patients. Using serial tissue sections of the Yale TMA-76 melanoma cohort, both immunofluorescence and Hematoxylin-and-Eosin (H&E) staining were performed to understand the molecular characteristics of each TIL phenotype and their associations with survival outcomes. Findings Five previously-described TIL variables were each significantly associated with overall survival (p<0.0001). Assessing the receiver operating characteristic (ROC) curves by comparing the clinical impact of two models suggests that etTILs (electronic total TILs) (AUC: 0.793, specificity: 0.627, sensitivity: 0.938) outperformed eTILs (AUC: 0.77, specificity: 0.51, sensitivity: 0.938). We also found that the specific molecular subtype of cells representing TILs includes predominantly cells that are CD3+ and CD8+ or CD4+ T cells. Interpretation eTIL% and etTILs scores are robust prognostic markers in patients with primary melanoma and may identify a subgroup of stage II patients at high risk of recurrence who may benefit from adjuvant therapy. We also show the molecular correlates behind these scores. Our data support the need for prospective testing of this algorithm in a clinical trial. Funding This work was also supported by a sponsored research agreements from Navigate Biopharma and NextCure and by grants from the NIH including the Yale SPORE in in Skin Cancer, P50 CA121974, the Yale SPORE in Lung Cancer, P50 CA196530, NYU SPORE in Skin Cancer P50CA225450 and the Yale Cancer Center Support Grant, P30CA016359.


Introduction
Melanoma is considered a highly immunogenic tumor and is responsive to immunotherapy. 2 Checkpoint inhibitor immunotherapies targeting programmed cell death protein 1 (PD-1) or cytotoxic T-lymphocyte-associated protein 4 (CTLA-4) have been shown to significantly improve the overall survival of patients with advanced stage metastatic disease. 3,4 Furthermore, adjuvant treatment with immune checkpoint blockade (or BRAF pathway targeted therapy in BRAF mutant melanoma patients) improved relapse-free survival (RFS) compared to placebo in phase 3 clinical trials 5,6 and is now regarded as standard of care in high-risk stage III melanoma patients. Furthermore, it was recently reported that adjuvant anti-PD1 immunotherapy improved relapse-free survival in high-risk stage II melanoma patients. However, up to 30% of stage III patients treated with adjuvant immunotherapy developed disease recurrence. Furthermore, treatmentrelated adverse events occur in at least one in five patients, and treatment related fatalities have been reported in up to one in one hundred patients. 5,7 There is therefore an urgent need to identify patients at high risk of disease relapse who may benefit from adjuvant therapies and those patients at low risk of relapse who can be safely spared further treatment and the concomitant risks of treatment related adverse events.
Tumor-infiltrating lymphocytes (TILs) reflect the host's immune response against cancer cells. 2,8 Correlation between different TILs infiltrates and improved survival in melanoma patients has been reported in multiple studies. [9][10][11][12] Traditionally, TIL scores have been visually assessed by the pathologists, but due to the lack of reproducibility caused by subjective assessment between pathologists and institutions, TIL scoring techniques have not been broadly clinically implemented. 13 To increase objectivity, a machine learning algorithm developed using open source software, QuPath, has been shown to facilitate the investigation of complex spatial patterns by firstly classifying four cell types, including tumor cells, TILs, stromal cells, and "other" cells, for the assessment of the proportion TILs within different cell populations. 1,14 Acs et al. highlighted the robustness of the NN192 machine learning algorithm by comparing the performance between eTIL scores and pathologist TIL scores. 1 The latter study also showed that TILs score assessed by the NN192 algorithm was an independent prognostic marker in melanoma.
Here, we used the same cell classifier to validate the association of percent electronic TILs (eTIL%) with disease-specific survival in patients with melanoma in a broad set of cohorts from melanoma centers around the world. For this effort, we used the previously established cut-point of eTIL% 1 to test The Cancer Genome Atlas (TCGA) melanoma cohort and four other melanoma cohorts from Yale School of Medicine, Melanoma Institute Australia, Tubingen University, and New York University. Furthermore, we also tested both cell types and area, variables that pathologists cannot easily calculate, for their prognostic significance. The goal of this effort is not to help pathologists more accurately assess TIL, but to validate this prognostic assay for potential pathologist-independent use in future prospective trials to determine which melanoma patients might be spared immunotherapy in the adjuvant setting. We aimed to specify the most reliable operator-independent identification of eTIL%, which can be prospectively validated in future studies. Finally, we assessed the prognostic role of the immunophenotypic subtypes of TILs, as defined by CD45+, CD3+, CD4+, CD8+, CD56+, CD20+, and FOXP3+ cells, using the Yale cohort to better understand the TIL subsets/phenotypes and their associations with patients' survival. Our ultimate goal was to determine the best approach for utilizing TIL infiltrates to predict disease outcome in melanoma patients that can be validated prospectively with the goal of proving clinical utility to select the high-risk subset of melanoma patients that are likely to not require immunotherapy in the adjuvant setting.

Study population
We assessed retrospectively collected samples from 5 independent cohorts from: (1) 7 based NN192 melanoma machine learning algorithm with neural network method 1 was applied for cell classification in this study. For WTS, the tumor and a 1À2-millimeter (mm) diameter surrounding tumor microenvironment to be analysed were carefully selected for accurate prediction of TILs. The area selection was reviewed by a pathologist. Due to the varying intensity both between and within cohorts, the "estimate stain vectors", ESV, function in QuPath was used to refine the H&E stain for each digitised slide. The workflow for stain normalisation using ESV function was shown in Supplementary Figure 2. The number of cells identified as tumor and immune cells (in %) across multiple centers were shown in Supplementary  Figure 3. Cell segmentation and classification were performed using the parameters previously described. 1

Assessment of eTILs using five variables
The machine-defined TILs variables were constructed using five different methods, as previously described. 17 The first and established method was to calculate eTIL% representing the proportion of TILs over tumor cells, calculated as (TILs/TILs + tumor cells) x 100. 1    The last method mimics the manual pathologist scoring of stromal TILs according to the International Immuno-Oncology Biomarker Working Group on Breast Cancer. 18 The variables are shown schematically in Figure 3a.

Multispectral image acquisition and cell counting on serial sections YTMA-76
Image acquisition of the stained slides was performed using Vectra/Polaris (Akoya Biosciences, Marlborough, MA) microscope to obtain MSIs (multispectral images). Briefly, a low magnification scan of the whole TMA slide was performed at 4£. The regions of interest (ROI) scan were then selected from a low-resolution using the Phenochart viewer (Akoya Biosciences), and the ROIs were subsequently acquired at the higher resolution MSIs at 20£. To analyse the MSIs, the spectra were extracted from acquired images to build the spectral library consisting of all fluorophores using inForm image analysis software version 2.4.9 (Akoya Biosciences), and the absence of spectral overlap between channels was checked by evaluating the unmixed images. The acquired multispectral images were then decomposed using a spectral library. Tumor, stroma, and background were identified using the trainable tissue segmentation option in InForm. Cell segmentation within tumor and stroma regions was performed using the parameters including minimum nuclear size and splitting sensitivity and the signals of the nuclei, cytoplasm, and membrane components as individual cells.

Hematoxylin & Eosin (H & E) staining for serial sections of YTMA-76
To count total TILs in the same TMA sections where molecular subtypes of cells were determined, coverslips

Statistical analysis
Statistical analyses were performed using GraphPad Prism 9.1.0 (GraphPad Software Inc., CA, USA) and R. studio 1.4.1106 (Inc., Boston, MA). The cut-points for cell types and area variables were determined using Xtile cut-point finder. 20 KaplanÀMeier plots for diseasespecific survival (DSS) and overall survival (OS) were computed and comparisons were made by the log-rank test using survival and survminer packages in R studio. Post hoc Benjamini-Hochberg (BH) multiple comparisons test was performed when the results for each variable in survival analyses were significant. ROC curves were constructed from logistic regression models for the prediction of DSS. All statistical tests were twosided, and significance was represented as (*) p < 0.05, (**) p < 0.01, (***) p < 0.001, (****) p < 0.0001, or not significant (ns). To perform univariable analyses of each factor, a Cox proportional hazards model was fitted to predict survival from each factor in turn. For the multivariable analysis, a Cox proportional hazards model was learned using eTIL%, age, gender and stage as predictors. These variables were chosen, since they were all present in a common group of 3 cohorts. For each test, we quote the hazards ratio associated with each level of a factor compared to the base reference level, and the associated p-values.

Role of funding source
None of the funders were directly involved in the study design, data collection, analyses, interpretation, or writing of the manuscript. The corresponding author (David L Rimm) has full access to all the data and the final responsibility for the decision to submit for publication.

Measurement of eTIL% as a prognostic variable in cohorts from multiple-institutions
Assessment of eTIL% using the NN192 machine learning cell classifier and the established cut-point of 16.6 % in the five cohorts from multiple institutions showed that high eTIL% was associated with longer overall survival in TCGA cohort (hazard ratio (HR) = 0.1, p < 0.0001), better disease-specific survival in Tubingen cohort (HR = 0.31, p = 0.013), and Yale cohort (HR = 0.37, p = 0.005) (Figure 1c, 1e, 1f), but not in the NYU or Sydney cohorts (Figure 1a, 1b). However, the assessment of visual pathologist-read TILs in Yale Cohort showed no significant association with disease specific survival (Log-Rank p = 0.39, Figure 1g). Evaluation of the stage distribution of these cohorts showed that approximately 98% of patients in the NYU cohort and 43 % of patients in the Sydney cohort were stage III and IV patients, whereas stage III and IV patients in the Tubingen, Yale and TCGA cohorts were 6%, 0% and 0% respectively (Figure 1d). The results indicated that assessing eTIL% scores as a prognostic factor for patients with melanoma may be stage dependent as high eTIL% were associated with better prognosis mainly in patients with stage I and II disease. We generated a combined cohort containing 764 patients from Yale, Tubingen, Sydney, TCGA, and NYU cohorts. Univariable and multivariable analyses were performed to assess the association of eTIL% and the clinical pathological features with survival ( Table 2). eTIL% (with a predefined 16.6 % cut-point), Stage, Breslow, ulceration and histogenesis were all significantly associated with survival in univariable analyses. The multivariable analysis showed that eTIL% was a significant predictor, even when combined with the other pathological features in a single model. Further, in an analysis of stage 2 only patients of the combined cohort with the 16.6 % cut-point, the results showed that higher eTIL% is associated with better prognosis (HR = 0.45, p = 0.00068) (Figure 2a), but not stage III and IV (HR = 1.48, p = 0.14) (Figure 2b). Our results could support the previous finding that patients with eTIL% >=16.6 have a significantly better prognosis.

Assessment of five TIL variables in multi-institutional stage I and II combined cohort
To identify the best approach to measure eTILs for potential future clinical adoption, we tested five different methods to assess the densities and proportions of eTILs based on the cell types and the area analyzed. We used the TCGA cohort as a training set to find every possible cut-point and the association of each TIL variable with patient outcome. The optimal cut-points defined in the TCGA cohort for each variable, the p-values, and HRs derived for measurement of the cohorts from assessing the optimal cut-points are shown. Since the purpose of this assessment was to compare the performance of five variables, it is statistically unsound to compare the p-values of each variable but ok to compare the HRs to determine relative prognostic strength. We compared the estimated difference between the HRs, with confidence intervals (CIs), between variables, and reported the p-value evaluating the null hypothesis of no association between the prognostic variable and outcome. Our results show a significant association of all variables with OS (Figure 3b). The HRs between all five TIL variables are similar, but eTILs (HR=0.31, 95% CI (confidence interval) =0.19-0.5, p < 0.0001) and etTILs (HR=0.29, 95% CI=0.18-0.44, p < 0.0001) appeared to be more robust methods in the combined cohort (although not statistically significantly better). This indicates that eTILs and etTILs might be better performing methods than the remaining three methods in large future cohorts. This is concordant with the prognostic results with eTIL% previously reported in melanoma. 1 Figure 4b. In addition to the lymphoid lineage markers including CD3, CD4, CD8, CD20, CD45, CD56, and FOXP3, we also examined myeloid lineage markers such as CD14, CD68, CD34 to accurately identify the specific cell types within the tumor microenvironment of YTMA-76. The relationship between each molecular subtype of cell and TILs% was assessed by  (Figure 4c). These results indicate that the lymphocytic phenotypic marker most highly correlated to TILs is CD45 (leukocyte common antigen), and the specific phenotypic subtype of cells representing TILs may be CD3+ or CD8+ or CD4+ T cells. This was similarly reported previously in melanoma. 1,21 These findings were corroborated by the cell type-specific survival analyses, which showed similar profiles (Supplementary Figure 1). The KaplanÀMeier estimates of survival test using the median as a cut-point showed   that patients with high cell counts of CD3 (HR = 0.21, p = 0.015) and CD8 (HR = 0.2, p = 0.015), as well as myelocytic macrophage marker CD68 (HR = 0.33, p = 0.015) have significantly improved DSS (Figure 4d).

Identification of molecular subtypes of TILs
As accumulated evidence showed that CD3+ and CD8+ lymphocytic cell infiltration is the primary determinant of immunotherapy outcome, evaluating the combination of both TILs and a group of specific molecular subtypes of TILs in association with the clinical outcome might assist in defining a subset of patients that might respond to immunotherapy. Next, we assessed the utility of TILs and specific molecular subtypes of cells to predict event risk. We generated the area under the receiver operating characteristic (ROC) curves (C-statistic) for TILs variables such as eTILs and etTILs, and all lymphocytic markers The optimal cut-points defined in the TCGA cohort as a training set for each variable, the p-values (log-rank) and HRs with 95% CI derived for measurement of the cohorts from assessing the optimal cut-points were shown.
including CD4, CD3, CD8, CD45, CD20, FOXP3 and CD56 ( Figure 5) to measure the risk prediction associated with these markers. eTILs achieved a favorable prognostic performance where the area under the curve (AUC) value for DSS was 0.77 (CI: 0.642-0.894). Of seven markers tested for AUC, CD4 (AUC: 0.723, CI: The predefined cut-point 16.6% for eTILs, the optimal cut-point 14.3 defined in the TCGA cohort for etTILs and the median cutpoints for all other cell types were used to stratify the patients. The adjusted p-values (log-rank, adjusted by Benjamini and Hochberg, BH) and HRs with 95% CI derived for measurement of the cohorts from assessing the optimal cut-points were shown. 0.593À0.853) showed the highest prognostic performance (Figure 5a). In agreement with the prior work, 21 our results suggested that the prognostic value of TILs appeared to be driven by CD4+ T cells. The AUC of myelocytic macrophage marker CD68 was 0.678 (CI: 0.535-0.823), exhibiting a higher prognostic value. Our AUC analysis of etTILs variable predicting event risk in YTMA-76 showed that etTILs (AUC: 0.793, CI: 0.672-0.914, specificity: 0.627 and sensitivity: 0.938) outperformed eTILs (AUC: 0.77, CI: 0.642-0.897, specificity: 0.51 and sensitivity: 0.938) which validated our finding in a low-stage combined cohort from multiple institutions ( Figure 5b). Taken together, assessment of ROC curves by comparing the clinical impact of two variable models (i.e., eTILs and etTILs) suggests that etTIL% might be a more robust variable to use clinically.

Discussion
The use of immunotherapies, including Pembrolizumab and Ipilimumab in the adjuvant setting, has been shown to successfully manage stage III melanoma. 5,7 However, since more patients with stage I or II melanoma are diagnosed than stage III, 22 it is important to evaluate the role of these agents as adjuvant therapies in patients with early-stage melanoma. This is especially important because only one in five of stage III melanoma patients benefit from immunotherapy, and 50% are cured by surgery alone as seen in the placebo arm. 5 These data suggest that we could identify the subset of patients that could potentially be spared immunotherapy toxicity since they are unlikely to benefit. 1 Our study pools patients with melanoma from international cohorts and supports the previous finding that eTIL% score is an independent prognostic marker in melanoma and shows that the effect is seen only in low stage patients, the population containing the stage 2 patients that receive adjuvant therapy.
A highly reproducible estimate of the TIL calculation is needed to use the machine learning TIL score in the clinic. Unlike the NN192 algorithm, which relies on cell detection and classification, there are many machine learning models that train with patches rather than with cell detection to generate a TIL map that characterises lymphocytic infiltrates in intra-tumoral, peri-tumoral, and adjacent stromal regions. 23 To identify the best approach for calculating TILs, we considered using multiple variables based on both the cell types (tumor cells, immune cells, fibroblasts, or other cells) and the area of interest (tumor and adjacent 1,2 mm diameter stroma). The test of five TIL variables in the low stage multi-institutional combined cohort showed that all TIL variables had comparable prognostic value; but eTIL% and etTILs % had the best performance. We note that TILs can also be predictive for immunotherapy, but evaluation of these variables in the context of immunotherapy is beyond the scope of this work.
A key limitation of our work is that we have used a machine learning algorithm that is susceptible to cell assignment error during cell classification. Another limitation is that melanoma cells have uneven membrane patterns and highly irregular cell shapes. Since melanoma cells change their shape and can adopt the appearance of other cells to invade any tissue in the body, 24 it is often difficult to accurately classify tumor cells resulting in misclassification of malignant melanoma cells as the fibroblasts or other cells. As etTIL% was calculated using total cells as the denominator, the likelihood of incorrect percentage automated TIL calculation due to assignment error might be reduced as opposed to eTIL% method, which used the sum of only immune and tumor cells as its denominator. As a result, etTIL% may have higher reproducibility than eTIL% in patients with melanoma. However, we found no statistically significant difference between our twobest model variables. Finally, the use of 0.6-mm diameter tissue cores TMAs is an additional limitation of this study. The advantage of large numbers of cases accessible by TMA is the trade-off between assessing the fraction of the tumor and maximizing the number of samples. As such, the evaluation of TILs in both TMA and WTA formats show significant association with survival. Further we note the the TILs quantification done here is to illustrate the heterogeneity of the immune infiltrate that is recognized as eTIL. We do not attempt to validation the prognostic value of molecularly defined TIL in this work.
TILs comprise a heterogeneous cell population including natural killer (NK) cells, B cells and various subsets of T cells with complex functional states (e.g., naive, effector, memory, and dysfunctional) and thus, the prognostic value of the automated TIL may arise from the unbiased combination of distinct molecular subtypes. 25 Here, we evaluated various molecular subtypes of lymphocytes within the TILs population and their prognostic values assessed by DSS. Concordant with the study by Piras, et al., 26 our results show that high CD8+ and CD3+ cells were associated with favorable DSS though the hazard ratio was higher than that of eTIL% and etTIL% variables (0.2 versus 0.13 and 0.075). In the study of Acs, et al., only a weak-fair correlation was reported between eTILs and CD4 and CD8 expression. 1 The inherently subjective nature of the user-supervised training process for cell segmentation of IF images using cell segmentation algorithm might be a major contributor to this variation. The cell count analyses using IF cell segmentation platforms can lead to inconsistent outcomes, especially if the assessed cohort is not statistically powered. The provisional solution might be to use a combined application of both IF and H & E that are sufficiently generic to be easily trainable while consistently achieving high sensitivity and specificity with validation.
In summary, we validated that eTIL% score is a robust prognostic marker in patients with early-stage melanoma and identified distinct TIL subpopulations that carry the prognostic value. Pending prospective validation, the use of the NN192 machine learning algorithm might evolve into a useful and easy-to-implement tool that will aid in risk stratification of patients with early-stage melanoma. In the future, the use of eTILs might be complemented with molecular subtyping of cells for more discriminating analyses. The use of a combined marker signature may be proven to be the best approach to define a subset of patients that will not benefit from immunotherapy or might develop significant toxicities.